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We introduce a new conservative test for quantifying the consistency of two or more datasets. 
The test is based on the Bayesian answer to the question, "How much more probable is it that 
all my data were generated from the same model system than if each dataset were generated from 
an independent set of model parameters?". We make explicit the connection between evidence 
ratios and the differences in peak chi-squared values, the latter of which are more widely used and 
more cheaply calculated. Calculating evidence ratios for three cosmological datasets (recent CMB 
data (WMAP, ACBAR, CBI, VSA), SDSS and the most recent SNe Type lA data) we find that 
concordance is favoured and the tightening of constraints on cosmological parameters is indeed 
justified. 

PACS numbers: 98.80.Cq 



I. INTRODUCTION 

The apparent mutual agreement of a wide range of 
cosmological observations has led to the current climate 
of "concordance" in cosmology P, 0, i, i, d, 0, 0, Hi- 
The practice of combining independent datasets, by the 
multiplication of their associated likelihood functions, in 
order to increase the precision of the parameters of the 
world model is now standard, but quantitative consis- 
tency checking is emphasised to a much lesser degree. 
As all physicists will agree, accurate cosmology is prefer- 
able to precision cosmology, and it is this that motivates 
this short communication. 

The purpose of this work is to demonstrate one appli- 
cation of Bayesian model selection, that of checking that 
the far simpler model of a universal set of parameters for 
modeling all datasets is justified by the data themselves: 
in doing so we make the connection between the Bayesian 
formulation of the problem and the pragmatic approach 
taken at much lower computational cost by the experi- 
mental community. In this work we show that, as is so 
often the case, the standard approach is justified on the 
grounds of common sense, and demonstrate the reduc- 
tion of this common sense to calculation via probability 
theory. 

As usual, the route to model selection is via the 
Bayesian evidence. The evidence for a model H from data 
d is just the probability Pr(d|H), and can be calculated 
in principle by marginalising the unnormalised posterior 
probability distribution function over all M parameters 
in the model: 

Pr(d|H) = y"pr((i|6>, H)Pr(6>|H)d^6/. (1) 
In practice, calculating this integral is rarely feasible, but 



other techniques exist to provide estimates of the evi- 
dence [see, for example,!^. More detailed introductions 
to the evidence and its central role in the problem of 
model selection are available elsewhere [l^, HII - here 
we make the general remarks that the evidence increases 
sharply with increasing goodness-of-fit, and decreases 
with increasing model complexity (quantifying the princi- 
ple of Occam's razor) . We show later explicitly how these 
two aspects come to the surface and, for the specific case 
of Gaussian measurement errors, result in model selec- 
tion proceeding by the comparison of differences in the 
ubiquitous chi-squarcd statistic with an "Occam" factor 
which takes the form of an effective number of parame- 
ters. The more general approach advocated here is ap- 
plicable to any likelihood functions (Pr(<i|0, H)). not just 
those having Gaussian form, and takes into account the 
full extent of the pdf 's involved. It is of course also sensi- 
tive to the parameters' prior pdf (Pr(0|H)): broader pri- 
ors represent more complex models and so naturally give 
lower evidence values. Evidence is the natural tool for 
comparing datasets in this way: it enables us to quantify 
such questions as "Is the mismatch between two experi- 
ments large enough to warrant investigation into possible 
sources of systematics or new physics?" 

The simplest model for all the cosmological data in 
hand is that they provide information on the same set of 
cosmological parameters: this is the standard assumption 
made in all the joint analyses to date. Let Hq represent 
the hypothesis that "there is one set of parameters that 
describes our cosmological model." In other words, we 
believe that we understand both cosmology and our ex- 
periments to the extent that there should be no further 
freedom beyond the parameters specified. However, if 
we are interested in accuracy as well as precision then we 
should take care to allow for systematic differences be- 
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tween datasets: the most extreme case would be the one 
where the observations were in such strong disagreement 
that they appeared to give conflicting measurements of 
all the model parameters. In this case one could consider 
the hypothesis H i that "there is a different set of param- 
eters for modeling each dataset." The conservatism of 
such a model comparison exercise is readily apparent: the 
large increase in model complexity incurred when mov- 
ing from Ho to Hi means that the joint analysis is intrin- 
sically more favourable. This means that any result in 
favour of H 1 may be taken as a clear indication of discord 
between the two experiments. Note also that this test is 
easily done given that the evidence values will have been 
calculated for alternative purposes, such as comparing 
two physical models in the light of each dataset alone. 

For checking dataset consistency then the quantity we 
should calculate is the ratio of probabilities that each 
model is correct, given the data: 



Pr(Ho|d) Pr(d|Ho) Pr(Ho) 



Pr(Hi|d) Pr(d|Hi) Pr(Hi) 
The calculable part of Equation [2] is the evidence ratio 



(2) 



R 



Pr(d|Ho) 
Pr(d|Hi) 

Pr(rf|Ho) 
aPi-(d,|Hi) 



(3) 



where in the second line we have assumed that the indi- 
vidual datasets di under analysis are independent. (The 
evidence integral factors out since the independent likeli- 
hoods do, and also because each likelihood depends only 
on its own subset of parameters.) Interpretation of this 
evidence ratio is aided by Equation [51 for statement Hg 
to be more likely to be true than statement Hi, the prod- 
uct of R and the prior probability ratio must be greater 
than one. Suppose that an evidence ratio R of 0.1 were 
found: the dataset combination (Hq) can still be justi- 
fied, but only if you are willing to take odds of ten to one 
on there being no significant systematic errors in the sys- 
tem. Blindly multiplying N likelihoods together results, 
in general and approximately, in factors of improvement 
in precision of a/ZV : the evidence ratio gives an indica- 
tion of whether or not this improvement is justified, in 
the form of an odds ratio (which enforces honesty through 
the threat of bankruptcy). 

Other criteria besides evidence have been used to com- 
pare different models. Recently [l3l have proposed the 
Akaike and Bayesian information criteria to carry out 
cosmological model selection. These criteria are approx- 
imations to the full Bayesian evidence under rather re- 
strictive assumptions and thus fall under the same frame- 
work. The posterior Bayes factors proposed by [ll] and 
also discussed in [3l can be used as an alternative to ev- 
idence. This quantity is the Bayesian evidence with the 
prior set to the posterior and can be readily estimated as 
an average likelihood of the Markov Chain Monte Carlo 
chains. It has some desirable properties, such as no prior 



dependence in the limit of prior enclosing the entire vol- 
ume of posterior. However, it has no simple interpre- 
tation within Bayesian framework and will thus not be 
discussed in this paper. The use of the evidence itself 
as a model selection tool has been growing in cosmology 
[see e.^. [H, [H, [13, nil. Using evidence to check dataset 
consistency has received much less attention. Applica- 
tion to a particul ar p roblem of CMB map contamination 
can be found in [16| . In this work we construct a much 
more general approach that can be applied to any set- 
ting in which a given model is tested against more than 
one dataset. The price one has to pay for this general- 
ity is that we are relatively insensitive to any particular 
inconsistency. We also aim to provide a short tutorial, es- 
tablishing the connection with the more conventional 
statistics, followed by a simple analysis of current state- 
of-the-art experiments. 



II. CONNECTION TO ANALYSIS 

Consider a general likelihood function of some model 
parameter vector x, which can be (for reasons that will 
become apparent in a moment) rewritten as 



(4) 



where Lmax is the likelihood at the most likely point in 
the parameter space and the dimcnsionless function L 
contains all the likelihood shape information. Assuming 
a uniform prior spanning between —p and p in each di- 
rection, where p is large enough to encompass all region 
of high likelihood, gives the approximate evidence 



E : 



jLd 



(2p) 



M 
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where M is the number of parameters in the model. If 
we identify the numerator of the above fraction with the 
volume associated with the likelihood Vl, and the de- 
nominator with the available prior volume l/^r, we have 



log£; = log 



Vl 
14 



log Li, 



(6) 



All the details of the overlap between prior and likelihood 
is contained with in the volume ratio, whereas the max- 
imum likelihood value specifies the goodness-of-fit. Ex- 
cept when the posterior pdf 's take simple analytic forms, 
this volume factor must be calculated numerically and of 
course takes up much of the effort in the evidence calcu- 
lation. 

In the case where the measurement errors are Gaus- 
sian, we can write the evidence ratio used in this work 
in terms of the best-fit chi-squared values that may be 
calculated during an analysis. It can be shown that 



log R = log 



V12V. 
V1V2 



(7) 
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where Ax^ = ~ (Xi + xl)- Defined this way, Ax^ is 
always positive (the goodness-of-fit cannot decrease with 
the addition of the extra parameters) and we see that the 
borderline case of log i? = corresponds to the difference 
in chi-squared between the two individual analyses and 
the joint fit being equal to an effective number of param- 
eters (difference in number of degrees of freedom) given 
by the logarithm of the volume factor. 

Returning to the general case, if we retain the assump- 
tion of a broad uniform prior, and if the likelihoods are 
well approximated by multivariate Gaussians, then the 
volume factor can be calculated analytically: in this case 
the i*'^ likelihood can be written as 



TABLE I: The priors assumed for the cosmological model 
considered in this paper. The notation (a, b) for parameter x 
denotes a top-hat prior in the range a < x < b 



Basic parameter Prior 

h 

ris 
r 

log 10"^° As 



(0.005, 0.05) 
(0.01,0.4) 
(-0.3,0.3) 
(0.4,0.9) 
(0.8,1.2) 
(0.01,0.7) 
(1.5) 



Li « Li exp 



--{x-XifF^ \x-Xi) 



(8) 



where Fi is the Fisher matrix. This gives, for the likeli- 
hood volumes. 



(9) 



In the joint analysis, combining two Gaussian likelihoods 
results in a new Gaussian, centred at a correctly weighted 
mean of positions, but whose shape is given simply by 



n2 



exp 



hx ~ x,f {F, + F2) \x-x,) 



and therefore 



V12 = (27r)*^/' |Fi + F2 



,1/2 



, (10) 



(11) 



Note that in this case, due to the high symmetry of 
the Gaussian approximation, the overlap integral V12 is 
independent of the distance between best fitting points. 
Therefore using A^^ as a proxy for the Bayesian evidence 
change is valid when the Gaussian approximation to the 
posterior is a good one. In the simple case where Fi — F2 
(a parallel degeneracy) and K- = (2p)*^ again, the log 
evidence becomes: 



logi? 



M log 



P 



Ax' 



(12) 



The log term is typically of the order of unity: 
is the geometrical average of the principal variances and 
hence is the square of the ratio of the prior 

width to the characteristic likelihood width. Hence we re- 
cover the frcquentist rule of thumb that the increase in x^ 
is justified if the number of parameters drops by roughly 
the same number. However the evidence considerations 
above allow this rule to be calibrated to take into account 
both the prior information supplied and the (potentially 
complex) shape of the likelihoods; in general, V12 is not 
independent of the individual peak positions, and so the 
simple Ax^ procedure does not propagate all the infor- 
mation contained within the likelihood functions. 



III. COMPARING COSMOLOGICAL DATASETS 
A. Datasets and method 

We use a version of the CosmoMC software package [l3 |. 
modified to calculate evidence by the thermodynamic in- 
tegration method. We obtain consistent results using two 
different methods to calculate the evidence reliably: the 
error on the log evidence differences is conservatively es- 
timated to be of the order of unity. The details of the 
evidence calculation method is presented elsewhere [lOj . 

We have chosen three datasets for comparison: 

• CMB: We use the "standard" selection of CMB 
experiments: the WMAP data [2^ together with 
latest VSA [2l|, CBI [1] and ACBAR data [H]. 
We also used a modified version of the likelihood 
code that correctly accounts for the largest WMAP 
scales [l^l 

• SN: We use the Riess et al. (2004) SN data. We use 
both "gold" and "silver" datasets. We implemented 
our likelihood code and checked that it gives results 
consistent with Riess et al. 

• SDSS: Finally we use large scale power spec- 
trum measurements from the SDSS experiment 
m, m, [2I. We used the likehhood code by 
Tegmark [5| adapted for CosmoMC by Samuel Leach 
(private communication) . 

We investigate a 7-parameter cosmological model. In 
Table |T] we show the uniform priors assumed for the pa- 
rameters of our model. We take our priors to be com- 
paratively broad to approximate the state of ignorance 
we may have been in before any of the three experiments 
were performed. This has the effect of giving the data as 
much "freedom" as possible, and correspondingly making 
the evidence test somewhat conservative. 



IV. RESULTS AND DISCUSSION 

In Table nil we give the values of R for various combina- 
tions of datasets under discussion. We do not detect any 
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TABLE II: The logarithm of R for various combinations of 
datasets. See text for further discussion. 



Dataset combination 


logR 


CMB - SDSS 


0.23 


SDSS - SN 


1.5 


SN - CMB 


1.6 


CMB - SDSS - SN 


4.5 



discrepancy between datasets: all combinations of the 
datasets weakly favour Hq. In the last line of the table lU 
we report on the value of R for all experiments combined. 
In principle, it is possible to have three experiments be 
pair-wise consistent with each other, but not when all 
combined together (imagine, for example three degener- 
acy lines forming a triangle). Comfortingly enough, the 
three-way evidence test also abrogates Hi and due to a 
large number of extra parameters {i.e. twice as many as 
in other datasets) it has also a more positive detection of 
concordance. 

We have illustrated our methodology with application 
to real cosmological data. As expected, the data are con- 
cordant: any obvious conflict in the data would likely 
have been noticed using the "chi by eye" methods em- 
ployed to date. However, should such discrepancies occur 
in the future it is imperative to have a method to quan- 
tify these discrepancies in the most general settings where 



Gaussianity cannot be assumed and ever more complex 
parameter spaces are to be dealt with. 

A value of R less than unity (log R < 0) is a sign that 
we should investigate the mismatch between datasets fur- 
ther. This can be done by exploring more focussed mod- 
els, either with new cosmological parameters (if the ex- 
periments are reckoned to be well- understood) , or with 
additional nuisance parameters that quantify the possible 
systematic errors in the data. Disentangling the degen- 
eracy between new physics and systematic error can only 
be done if the additional parameters come with fresh in- 
formation encoded in their prior pdf: this information 
is then folded into the evidence ratio, providing the cru- 
cial difference between this methodology and any method 
relying on goodness-of-fit alone. 
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